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Abstract 

With the development of convolution neural network, 
more and more researchers focus their attention on the ad¬ 
vantage of CNN for face recognition task. In this paper, 
we propose a deep convolution network for learning a ro¬ 
bust face representation. The deep convolution net is con¬ 
structed by 4 convolution layers, 4 max pooling layers and 2 
fully connected layers, which totally contains about 4Mpa¬ 
rameters. The Max-Feature-Map activation function is used 
instead ofReLU because the ReLU might lead to the loss of 
information due to the sparsity while the Max-Feature-Map 
can get the compact and discriminative feature vectors. The 
model is trained on CASIA-WebFace dataset and evaluated 
on LFW dataset. The result on LFW achieves 97.77% on 
unsupervised setting for single net. 

1. Introduction 

In the past years, with the development of convolution 
neural network, numerous vision tasks benefit from a com¬ 
pact representation learning via deep model from image 
data. The performance in various computer vision applica¬ 
tions, such as image classification||3l, object detection lfT4l . 
face recognition lfTTl [TSl [161 and so on, achieved great 
progress. 

For the face verification task, the accuracy on LFW, a 
hard benchmark dataset, has been improved from 97% lfT5l 
to 99%inTl in recent year via deep learning model. The 
main frameworks for face verification are based on multi¬ 
class classification ITS] [TSll to extract face feature vectors 
and then the vectors are further processed by classifiers or 
patch model ensembles. However, the probability models 
such as Joint Bayesian|[Tl and Gaussian ProcessinglSl are 
based on strong assumptions which may not make effect 
on various situations. Other methods||5] [TOll are proposed 
to optimize verification loss directly for matching pairs and 
non-matching pairs. The disadvantage of these verification 
based methods is that it is difficult to select training dataset 
for negative pairs and the threshold in verification loss func¬ 


tion is set manually. Moreover, the joint identification and 
verification constraint is used for optimizing the deep face 
model in HI] in and it is also difficult to set the trade-off 
parameter between identification and verification loss for 
multi-task optimization. 

In this paper, we propose a deep robust face representa¬ 
tion learning framework. We utilize convolution networks 
and propose a Max-Feature-Map activation function, which 
the model is trained on CASIA-WebFace dataseQand eval¬ 
uated on LFW dataset. 

The contributions of this paper are summarized as fol¬ 
lows; 

(1) We propose a Max-Feature-Map activation function 
whose values are not sparse while the gradients are 
sparse instead. The activation function can also be 
treated as the sparse connection to learn a robust rep¬ 
resentation for deep model. 

(2) We build a shallower single convolution network 
and get better performance than DeepFace lffSl . 
Deer)ID2 llTTl and WebFace llT^ . 

The paper is organized as following. Section 2 briefly 
describes our convolution network framework and Max- 
Feature-Map activation function. We present our experi¬ 
mental results in Section 3 and conclude in Section 4. 

2. Architecture 

In this section, we describe the framework of our deep 
face representation model and the compact Max-Feature- 
Map activation function. 

2.1. Compact Activation Function 

Sigmoid or Tanh is a nonlinear activation for neural net¬ 
work and often leads to robust optimization during DNN 
traininglU. But it may suffer from vanishing gradient when 
lower layers have gradients of nearly 0 because higher layer 
units are nearly saturate at -1 or 1. The vanishing gradient 
may lead to converge slow or poor local optima. 

'http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html 
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Figure 1. Operation performed by Max-Feature-Map activation 
function 


To overcome vanishing gradient, the Rectified linear 
unit(ReLU)@ offers a sparse representation. However, 
ReLU is at a potential disadvantage during optimization be¬ 
cause the value is 0 if the unit is not active. It might lead 
to loss of some information especially for the first several 
convolution layers because these layers are similar to Ga¬ 
bor filter which both positive and negative responses are re¬ 
spected. To alleviate this problem, PReLU is proposed and 
it makes good effect on ImageNet classihcation tasklSl. 

In order to make the representation compact instead of 
sparsity in ReLU, we propose the Max-Feature-Map(MFM) 
activation function which is inspired by m. Given an input 
convolution layer C G ^hxwx 2 n^ shown in Fig[^ the 
Max-Feature-Map activation function can be written as 

where the number of convolution feature map C is 2n. The 
gradient of this activation function can be shown as 

^ ^ r 1, ifc"= > 

dC^ \ 0, otherwise ^ 

The Max-Feature-Map activation function is not a nor¬ 
mal single-input-single-output function such as sigmoid or 
ReLU, while it is the maximum between two convolution 
feature map candidate nodes. This activation function can 
not only select competitive nodes for convolution candi¬ 
dates, but also make the 50% gradients of convolution layers 
are 0. Moreover, the Max-Feature-Map activation function 
can also treated as the sparse connection between two con¬ 
volution layers, which can encode the information sparsely 
onto a feature space. 

2.2. Convolution Network Framework 


Table 1. The architecture of the proposed deep face convolution 
network. _ 


Name 

Type 

Filter Size 
/Stride 

Output Size 

input 

- 

- 

144 X 144 X 1 

crop 

- 

- 

128 X 128 X 1 

convUl 

convl^ 

mfml 

convolution 

convolution 

MFM 

9 X 9/1 

9 X 9/1 

120 X 120 X 48 
120 X 120 X 48 
120 X 120 X 48 

pooll 

max pooling 

2 X 2/2 

60 X 60 X 48 

conv2_l 

conv2_2 

mfm2 

convolution 

convolution 

MFM 

5 X 5/1 

5 X 5/1 

56 X 56 X 96 

56 X 56 X 96 

56 X 56 X 96 

pool2 

max pooling 

2 X 2/2 

28 X 28 X 96 

conv3_l 

conv3^ 

mfm3 

convolution 

convolution 

MFM 

5 X 5/1 

5 X 5/1 

24 X 24 X 128 
24 X 24 X 128 
24 X 24 X 128 

pool 3 

max pooling 

2 X 2/2 

12 X 12 X 128 

conv4_l 

conv4_2 

mfm4 

convolution 

convolution 

MFM 

4 X 4/1 

4 X 4/1 

9 X 9 X 192 

9 X 9 X 192 

9 X 9 X 192 

pool4 

max pooling 

2 X 2/2 

5 X 5 X 192 

fcl 

fully connected 

- 

256 

fc2 

fully connected 

- 

10575 

loss 

softmax 

- 

10575 


Feature-Map activation functions and 2 fully connected lay¬ 
ers as is shown in Fig|^ 

The input image is 144 x 144 gray-scale face image from 
CASIA-WebFace dataset. The detail parameters setting is 
presented in Tablej^ We crop each input image randomly 
into 128 X 128 patch as the input of the first convolution 
layer. The network include 4 convolution layers that each 
convolution layer is combined with two independent con¬ 
volution parts calculated from the input. The Max-Feature- 
Map activation function and max pooling layer are used 
later. The fcl layer is a 256-dimensional face representa¬ 
tion since we usually consider that the face images usually 
lie on a low dimensional manifold and it is effective to re¬ 
duce the complexity of the convolution neural network. The 
fc2 layer is used as the input of the softmax cost function 
and is set to the number of WebFace identities(10575). Be¬ 
sides, the proposed network has 4153K parameters which is 
smaller than DeepFace and WebFace net. 

3. Experiments 
3.1. Data Pre-processing 

CASIA-WebFace dataset is used to train our deep face 
convolution network. It contains 493456 face images of 
10575 identities and all the face images are converted to 
gray-scale and normalized to 144 x 144 according to land- 


The deep face convolution network is constructed by 
four convolution layers, 4 max pooling layers, Max- 
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Figure 2. An illustration of the architecture of our deep face convolution networks model. 


I 

I 

(a) (b) 

Figure 3. Face image alignment for WebFace dataset, (a) is the 
facial points detection results and (b) is the normalization face im¬ 
age. 

marks as is shown in Fig |3(a)| According to the 5 facial 
points extracted by 113 and manually revised, the distance 
between the midpoint of eyes and the midpoint of mouth is 
relative invariant to pose variations in yaw angle, therefore, 
it is fixed to 50 pixels and we also rotate two eye points hor¬ 
izontally to pos variations in roll angle. The normalization 
face image is shown in Fig |3(b)| 

3.2. Training Methodology 

To train the convolution network, we randomly select 
one face image from each identity as the validation set and 
the other images as the train set. The open source deep 
learning framework CaffeHSl is used for training the model. 

The input for convolution network is the 144 x 144 gray¬ 
scale face image and we crop the input image into 128 x 128 
and mirror it. These data augmentation method can improve 
the generalization of the convolution neural network and 
overcome the overfitting||7l. Dropout is also used for fully 
connected layer and the ratio is set to 0.7. 

Moreover, the weight decay is set to 5e-4 for convolution 
layer and fully connected layer except the fc2 layer. It is ob¬ 
vious that the fcl face representation is only used for face 
verification tasks which is not similar to the image classifi¬ 
cation and objection task. However, the parameters of fc2 
layer is very large. Therefore, it might lead to overfitting 


for learning the large fully-connected layer parameters. To 
overcome it, we set the weight decay of fc2 layer to 5e-3. 

The learning rate is set to le-3 initially and reduce to 5e- 
5 gradually. The parameters initialization for convolution 
is Xavier and Gaussian is used for fully-connected layers. 
Moreover, the deep model is trained on GTX980 and the 
iteration is set to 2 million. 

3.3. Results on LFW benchmark 

The evaluation is performed on LFW datasej^in detail. 
LFW dataset contains 13233 images of 5749 people for face 
verification. And all the images in LFW dataset are pro¬ 
cessed by the same pipeline as the training dataset and nor¬ 
malized to 128 X 128. 

For evaluation, the face data is divided in 10 folds which 
contain different identities and 600 face pairs. There are 
two evaluation setting about LFW training and testing: re¬ 
stricted and unrestricted. In restricted setting, the pre-define 
image pairs are fixed by author (each fold contains 5400 
pairs for training and 600 pairs for testing). And in unre¬ 
stricted setting, the identities of people within each fold for 
training is allowed to be much larger. 

According to Fig|^ compared with ReLU and Max- 
Feature-Map, the speed of convergence for Max-Feature- 
Map network is slower than ReLU due to the complexity 
of the activation and the randomness of initial parameters. 
However, with the progress of training, the validation accu¬ 
racy for Max-Feature-Map net outperforms ReLU. 

We test our deep model performance via cosine simi¬ 
larity and ROC curve. The result^ are shown in Table|^ 
and the EER on LEW achieves 97.77%, which outperforms 
DeepEace lfTSl . Deer)ID2 lfTTIl and WebEace lfThl for unsuper¬ 
vised settin^for single net. 


^http://vis-www.cs.umass.edu/lfw/ 

^The model and configuration are released on my Github 
^The unsupervised setting the model is not trained on LFW in super¬ 
vised way. 
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Figure 4. Comparison with ReLU activation function and Max-Feature-Map activation function validation accuracy for CNN training. 


Table 2. The performance of our deep face model and compared 
methods on LFW. _ 


Method 

#Net 

Accuracy 

Protocol 

DeepFace 

1 

95.92% 

unsupervised 

DeepFace 

1 

97.00% 

restricted 

DeepFace 

7 

97.35% 

unrestricted 

DeepID2 

1 

95.43% 

unsupervised 

DeepID2 

2 

97.28% 

unsupervised 

DeepID2 

4 

97.75% 

unsupervised 

DeepID2 

25 

98.97% 

unsupervised 

WebFace 

1 

96.13% 

unsupervised 

WebFace-nPCA 

1 

96.30% 

unsupervised 

WebFace-nJoint Bayes 

1 

97.30% 

unsupervised 

WebFace-nJoint Bayes 

1 

97.73% 

unrestricted 

Our model(ReLU) 

1 

97.45% 

unsupervised 

Our model(MFM) 

1 

97.77% 

unsupervised 


4. Conclusions 

In this paper, we proposed a deep convolution network 
for learning a robust face representation. We use Max- 
Feature-Map activation function to learn a compact low¬ 
dimensional face representation and the results on LFW is 
97.77%, which the performance is the state-of-the-art on 
unsupervised setting for single net as far as we know. 
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